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1 Introduction 



Let T be some subset of M p , p £ N, and let X = {X (i)) fgT be a stochastic process with 
values in M. Assume that X has zero mean E(X(i)) = for all t € T, and finite covariance 
a (s, t) = E (X (s) X (t)) for all s, t E T. Let ii, . . . ,t n be fixed points in T (deterministic design), 
Xi, ...,Xn independent copies of the process X, and suppose that we observe the noisy processes 

X i {t j )=X i {t j )+£ i {tj) for i = 1, j = l,...,n, (1.1) 
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where £\, ...,£n are independent copies of a second order Gaussian process £ with zero mean 
and independent of X, which represent an additive source of noise in the measurements. Based 
on the noisy observations (jl.ip . an important problem in statistics is to construct an estimator 
of the covariance matrix 5] = E (XX T ) of the process X at the design points, where X = 
(X (ti) , X (i„)) 1 . This problem is a fundamental issue in many applications, ranging from 
geostatistics, financial series or epidemiology for instance (see |Stein, 1999| , |Journel, 1977 



or 



|Cressie, 1993 Wikle and Cressie, 1999| for general references and applications). Estimating 



such a covariance matrix has also important applications in dimension reduction by principal 
component analysis (PCA) or classification by linear or quadratic discriminant analysis (LDA 
and QDA). 

In |Bigot et al., 2010] , using N independent copies of the process X, we have proposed to 
construct an estimator of the covariance matrix S by expanding the process X into a dictionary 
of basis functions. The method in Bigo t et al., 2010] is based on model selection techniques 
by empirical contrast minimization in a suitable matrix regression model. This new approach 
to covariance estimation is well adapted to the case of low-dimensional covariance estimation 
when the number of replicates N of the process is larger than the number of observations points 
n. However, many application areas are currently dealing with the problem of estimating a 
covariance matrix when the number of observations at hand is small when compared to the 
number of parameters to estimate. Examples include biomedical imaging, proteomic/genomic 
data, signal processing in neurosciences and many others. This issue corresponds to the problem 
of covariance estimation for high-dimensional data. This problem is challenging since, in a high- 
dimensional setting (when n >> N or n ~ N), it is well known that the sample covariance 
matrices 



1 N 

s = - x * x -7 G Rnxn . where x ^ = ( x i fa) . •••> x i (*«)) T .» = !» • • • » N 



and 



- X * X * T G Rn>< "> Wllere X * = ( X i (*l) > -i X i (*n)) = 1, 
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behave poorly, and are not consistent estimators of S. For example, suppose that the Xj's are 
independent and identically distributed (i.i.d.) random vectors in R n drawn from a multivariate 
Gaussian distribution. Then, when |->c>0asn,jV-> +oo, neither the eigenvalues nor the 
eigenvectors of the sample covariance matrix S are consistent estimators of the eigenvalues and 
eigenvectors of £ (see | Johnstone, 2001| ). This topic has thus recently received a lot of atten- 
tion in the statistical literature. To achieve consistency, recently developed methods for high- 
dimensional covariance estimation impose sparsity restrictions on the matrix 51. Such restrictions 
imply that the true (but unknown) dimension of the model is much lower than the number n ^" 2 +1 ^ 
of parameters of an unconstrained covariance matrix. Under various sparsity assumptions, dif- 
ferent regularizing methods of the empirical covariance matrix have been proposed. Estimators 
based on thresholding or banding the entries of the empirical covariance matrix have been studied 



in |Bickel and Levina, 2008a and Bickel and Levina, 2008b] . Thresholding the components of 



the empirical covariance matrix has also been proposed by |E1 Karoui, 2008 ] and the consistency 
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of such estimates is studied using tools from random matrix theory. |Fan et al., 2008| impose 
sparsity on the covariance via a factor model which is appropriate in financial applications. 
|Levina et al., 2008 and |Rothman et al., 2008 propose regularization techniques with a Lasso 
penalty to estimate the covariance matrix or its inverse. More general penalties have been stud- 
ied in |Lam and Fan, 2009 . Another approach is to impose sparsity on the eigenvectors of the 



covariance matrix which leads to sparse PCA. |Zou et al., 2006| use a Lasso penalty to achieve 
sparse representation in PCA, |d'Aspremont et al., 2008| study properties of sparse principal 
components by convex programming, while | Johnstone and Lu, 20 09 1 propose a PCA regular- 
ization by expanding the empirical eigenvectors in a sparse basis and then apply a thresholding 
step. 

In this paper, we propose to estimate 5] in a high-dimensional setting by using the assumption 
that the process X has a sparse representation in a large dictionary of basis functions. Using 
a matrix regression model as in [Bi got et al., 2010] , we propose a new methodology for high- 
dimensional covariance matrix estimation based on empirical contrast regularization by a group 
Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the 
dictionary used to approximate the process X. This leads to an approximation of the covariance 
matrix 5] into a low dimensional space, and thus to a new method of dimension reduction for 
high-dimensional data. Group Lasso estimators have been studied in the standard linear model 
and in multiple kernel learning to impose a group-sparsity structure on the parameters to recover 
(see |Nardi and Rinaldo, 2008] , |Bach, 2008| and references therein). However, to the best of 
our knowledge, it has not been used for the estimation of covariance matrices using a functional 
approximation of the process X. 

The rest of the paper is organized as follows. In Section [21 we describe a matrix regression 
model for covariance estimation, and we define our estimator by group Lasso regularization. The 
consistency of such a procedure is investigated in Section [3] using oracle inequalities and a non- 
asymptotic point of view by holding fixed the number of replicates N and observation points n. 
Consistency of the estimator is studied in Frobenius and operator norms. Various results existing 
in matrix theory show that convergence in operator norm implies convergence of the eigenvec- 
tors and eigenvalues (e.g. through the use of the sin(#) theorems in |Davis and Kahan, 19 70 1). 
Consistency in operator norm is thus well suited for PCA applications. Numerical experiments 
are given in Section [U and an application to sparse PCA is proposed. A technical Appendix 
contains all the proofs. 



2 Model and definition of the estimator 

To impose sparsity restrictions on the covariance matrix X, our approach is based on an 
approximation of the process in a finite dictionary of (not necessarily orthogonal) basis functions 
g m : T — > K for m = 1, M. Suppose that 

M 

X{t)^J2 a m9m(t), (2.1) 
m=l 
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where a m , m = 1, M are real valued random variables, and that for each trajectory JQ 

M 

Xi (tj) « E a «,m9m • (2.2) 

The notation ~ means that the process X can be well approximated into the dictionary. A 
precise meaning of this will be discussed later on. Then (|2,2p can be written in matrix notation 
as: 

XiwGai, t = l,...,JV (2.3) 
where G is the n x M matrix with entries 

Gj m = g m (tj) for 1 < j < n and 1 < m < M, 

and a, is the M x 1 random vector of components cij im , with 1 < m < M. 

Recall that we want to estimate the covariance matrix S = E (XX T ) from the noisy obser- 
vations (jl.ip . Since X ~ Ga with a = (o, m ) 1<rn<M with a m as in (|2.ip . it follows that 

E » E (Ga(Ga) T ) = E ^Gaa T G T ) = G**G T with = E (aa T ) . 

Given the noisy observations Xj as in (jl.ip with i = 1, ...,N, consider the following matrix 
regression model 

X i x7 = s + U i + W i i = l,...,JV, (2.4) 
where Uj = XjXj 1 " — S are i.i.d centered matrix errors, and 

W, = SiSj e M nxn where % = (Si (h) , (t n )) T , i = 1, . . . , N. 

The size M of the dictionary can be very large, but it is expected that the process X has a sparse 
expansion in this basis, meaning that, in approximation (|2,ip . many of the random coefficients 
a m are close to zero. We are interested in obtaining an estimate of the covariance XI in the 
form X! = G\I/G T such that ^ is a symmetric M X M matrix with many zero rows (and so, by 
symmetry, many corresponding zero columns). Note that setting the fc-th row of * to G R M 
means to remove the function from the set of basis functions (5m)i< m <Af m the function 
expansion associated to G. 

Let us now explain how to select a sparse set of rows/columns in the matrix . For this, we 
use a group Lasso approach to threshold some rows / columns of \I/ which corresponds to removing 
some basis functions in the approximation of the process X. For two p x p matrices A, B define 
the inner product (A,B)^ := tr (A T B) and the associated Frobenius norm ||A|||, := tr (A T A). 
Let Sm denote the set of M x M symmetric matrices with real entries. We define the group 
Lasso estimator of the covariance matrix XI by 

E A = G$ A G T G M nxn , (2.5) 

where ^\ is the solution of the following optimization problem: 

M 




* A = argmin { — > ^ HX^ 1 - G*G 



Ik 

k=l 



M 



m=l 
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where \l/ = (*mfc)i< m KA/ e R x , A is a positive number and 7^ are some weights whose 
values will be discuss later on. In (|2.6[) . the penalty term imposes to give preference to solutions 
with components *S?k = 0, where (^fc)i<fc<M denotes the columns of Recall that S = 

N „ „ 

^2 XjXj denotes the sample covariance matrix from the noisy observations (jl.ip . It can be 

i=l 

checked that minimizing the criterion (|2,6p is equivalent to 



argmm 



S - G*G 



A/ 

fc=i 



2 

771 fc 



(2.7) 



Thus £ j£- MX - M can De interpreted as a group Lasso estimator of X! in the following matrix 
regression model 



S = S + U + Ww G**G 1 + U + W, 



(2.1 



A' 



where U G M raxn is a centered error matrix given by U = ^ X^i=i U» and W = ^ Wj. In 

^ * =1 
the above regression model (|2.8|) . there are two errors terms of a different nature. The term W 

corresponds to the additive Gaussian errors £\, ...,£at in model (jl.ip . while the term U = S — S 
represents the difference between the (unobserved) sample covariance matrix S and the matrix 
S that we want to estimate. 

This approach can be interpreted as a thresholding procedure of the entries of an empir- 
ical matrix. To see this, consider the simple case where M = n and the basis functions and 
observations points are chosen such that the matrix G is orthogonal. Let Y = G T SG be 
a transformation of the empirical covariance matrix S. In the orthogonal case, the following 
proposition shows that the group Lasso estimator *S?\ defined by (|2.7p consists in thresholding 
the columns / rows of Y whose i^-norm is too small, and in multiplying the other columns / rows 
by weights between and 1. Hence, the group Lasso estimate (|2,7p can be interpreted as 
covariance estimation by soft-thresholding the columns/rows of Y. 

Proposition 1 Suppose that M = n and that G T G = I n where I n denotes the identity matrix 
of size n x n. Let Y = G T SG. Then, the group Lasso estimator defined by {2. 7| ) is the 
n x n symmetric matrix whose entries are given by 



7/7/1' 







K 



ink 



Y 2 

J = l 1 mfe 



if 
if 



Y 2 
1 x jk 



(2.9) 



for 1 < k,m < M. 



3 Consistency of the group Lasso estimator 

3.1 Notations and main assumptions 

Let us begin by some definitions. For a symmetric pxp matrix A with real entries, p m i n (A) 
denotes the smallest eigenvalue of A, and p max (A) denotes the largest eigenvalue of A. For 
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j3 G M q , \\/3\\e 2 denotes the usual Euclidean norm of /3. For p x q matrix A with real entries, 
|| A||2 = sup iSgK9j p^LQ ^^\\^ 2 denotes the operator norm of A. Recall that if A is a non negative 
definite matrix with p = q then ||A||2 = p ma x-(A). 

Let \I/ £ Sm and j3 a vector in M. M . For a subset J C {1, . . . , M} of indices of cardinality 
I J|, then f5j is the vector in R M that has the same coordinates as (3 on J and zeros coordinates 
on the complement J c of J. The n x | J| matrix obtained by removing the columns of G whose 
indices are not in J is denoted by Gj. The sparsity of \1/ is defined as its number of non-zero 
columns (and thus by symmetry non-zero rows) namely 

Definition 1 For \& 6 Sm, the sparsity of \& is 

Then, let us introduce the following quantities that control the minimal eigenvalues of sub- 
matrices of small size extracted from the matrix G T G, and the correlations between the columns 
of G: 



Definition 2 Let < s < M. Then, 

Pmin(s) := mf J — 2 = mf p min [GjGj 

JC{1,...,M} V / JC{1 M} V 

|J| < s |J| < s 

Definition 3 T/ie mutual coherence 0(G) of the columns G k , k = 1, . . . , M of G is defined as 

6(G) := max { Gj,G k , k ^ k', 1 < k, k' < m\ , 

and let 

G max :=niax{||G fe ||!, 1 < k < M} . 

To derive oracle inequalities showing the consistency of the group Lasso estimator the 
correlations between the columns of G (measured by #(G)) should not be too large when com- 
pared to the minimal eigenvalues of small matrices extracted from G T G, which is formulated 
in the following assumption: 



Assumption 1 Let cq > be some constant and < s < M . Then 

C0Pmax(G G)S 

Assumption [T] is inspired by recent results in [Bickel et al., 2009] on the consistency of Lasso 
estimators in the standard nonparametric regression model using a large dictionary of basis 
functions. In |Bickel et al., 2009| , a general condition called restricted eigenvalue assumption is 
introduced to control the minimal eigenvalues of the Gram matrix associated to the dictionary 
over sets of sparse vectors. In the setting of nonparametric regression, a condition similar to 
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Assumption [T] is given in | Bic kel et al., 2009] as an example for which the restricted eigenvalue 
assumption holds. 

Let us give some examples for which Assumption [1] is satisfied. If M < n and the design 
points are chosen such that the columns of the matrix G are orthonormal vectors in M n , then 
for any < s < M one has that Pmin 

(s) = 1 and 6(G) = and thus Assumption [T] holds for any 

value of Co and s. 

Now, suppose that the columns of G are normalized to one, i.e ||Gjfe||^ 2 = 1, k = 1,...,M 
implying that G max = 1. Let j3 £ M M . Then, for any J C {1, ... , M} with \ J\ < s < min(n, M) 

P T jG t GPj>Wj\\1-6(G)s\\Pj\\1, 

which implies that 

Pmin(s) > 1 " 6(G)S. 

Therefore, if (1 — 6(G)(s — l)) 2 > co6(G)p majX (G T G)s, then Assumption [1] is satisfied. 

Let us now specify the law of the stochastic process X. For this, recall that for a real-valued 
random variable Z, the tp a Orlicz norm of Z is 

\\Z\\^ a := inf |c> ; Eexp < 2 

Such Orlicz norms are useful to characterize the tail behavior of random variables. Indeed, if 
H^lUa < t nen this is equivalent to assuming that there exists two constants K\,K>i > 
such that for all x > 



|Z|>z)<^ ie xp (-Jj), 



(see e.g. Mendclson and Pajor, 2006 for more details on Orlicz norms of random variables) . 
Therefore, if ||Z||^ 2 < +oo then Z is said to have a sub-Gaussian behavior and if H^H^ < +oo 
then Z is said to have a sub-Exponential behavior. In the next sections, oracle inequalities for 
the group Lasso estimator will be derived under the following assumption on X: 

Assumption 2 The random vector X = (X (ti) , X (t n )) T £ R n is such that 

(Al) There exists p(S) > such that, for all vector (3 G W 1 with ||/3||^ 2 = 1, then 
(E|X T /3| 4 ) 1/4 </?(£)• 

(A2) Set Z = ||X||^ 2 . There exists a > 1 such that \\Z\\^ a < +oo. 

Note that (Al) implies that ||S||2 < p(S) 2 . Indeed, one has that 

||S|| 2 =aw(S) = sup /3 T S/3= sup E/3 T XX T /3 
/3eR", \\3h =l 8m n , \W\\t =l 



sup E|/3 T X| 2 < sup A /E|/3 T X| 4 < p 2 (S) 

/3eR", MU 2 =i /3e« n , ||/3|U 2 =i 
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When X is a Gaussian process, it follows that for any f3 6 W 1 with \\f3\\i 2 = 1 then 

(E|X T f3\ 4 ) 1/4 = 3 1 / 4 (/3 T S/3) 1/2 since X T /3 ~ iV(0, /3 T £/3). Therefore, under the assumption 

that X is a Gaussian process, Assumption (Al) holds with /o(S) = S 1 / 4 ^!!^ 2 . 

Assumption (A2) requires that H-Z'll^ < +00, where Z = ||X||^ 2 . The following proposition 
provides some examples where such an assumption holds. 

Proposition 2 Let Z = \\X.\\ h = (YZ=i l X (^)| 2 ) 1/2 - Then 
- If X is a Gaussian process 



\\Z\\to < V8/3VMS)- 

If the random process X is such that \\Z\\^ 2 < +00, and there exists a constant C\ such that 

— 1/2 

IjS^ |A(tj)| ||^, 2 < C\ for all i = 1, . . . ,n, then 



\\Z\lto < Ci^tr(S). 

- If X is a bounded process, meaning that there exists a constant R > such that for all t G T, 
|A(i)| < R, then for any a > 1, 

\\ZU a < ^(lo g 2)- 1 /-. 

Assumption [2] will be used to control the deviation in operator norm between the sample 
covariance matrix S and the true covariance matrix S in the sense of the following proposition 
whose proof follows from Theorem 2.1 in |Mendelson and Pajor, 2 006 . 

Proposition 3 Let X±, be independent copies of the stochastic process X, let Z = ||X||£ 2 

T N 

and Xj = (Xi (t\) , X{ (t n )) for i = 1,...,N. Recall that S = ^ XjX^~ and S = 

i=i 

E(XX T ). Suppose that X satisfies Assumption^ Let d = min(n, N). Then, there exists 
a universal constant 5* > such that for all x > 

P (||S - s|| 2 ^ T diN , n x) ^ exp (-(S^x)^ , (3.1) 
where tn,u = max (Ajy n , B^^ n ), with 

Vte£d(logNy/« p?(2) 1/2 

An,u = 7= and £>N.n = — j=r- + ^ 2 A N,n- 



Let us briefly comment Proposition [3] in some specific cases. If X is Gaussian, then Propo- 
sition [2] implies that A/v,n < A7v,n,i, where 

a w = vmV^) ^ d( ^ N)1/a < Vm \m\l /2 ^V^~d(io gN )^, (3 . 2 ) 



s 



and in this case inequality (|3.ip becomes 



S-S 



> max (A^ !n l ,Sjv,n,i) s J < exp (-(5* x :c) 2+a 



(3.3) 



for all x > 0, where B N , nyl = + ||S| 

If X is a bounded process by some constant R > , then using Proposition [2] and by letting 
a —> +oo, Proposition [3] implies that for all x > 0, 



1 1/ 2 yi 

1 2 AW,n,l- 



( 



S-S 



^ max 



(^AT,n,2 5 -B^n^) x) ^ exp i 



-CM 



where 



.4 



7V,n,2 



ylogd and B N>n>2 



p 2 (S) 



A/ 



+ IIS 



I 1 / 2 4 

1 2 A/V,n,2- 



(3.4) 



(3.5) 



Contrary to the low-dimensional case (n < < N) , in a high-dimensional setting when n >> N 
or when n and A" are of the same magnitude {jt — ^ c > as ra, Af — )• +oo), inequalities f|3.3[) and 

(|3.4|) cannot be used to conclude that the norm S — 5] concentrates around zero. Actually, 

it is well known that the sample covariance S is a bad estimator of S in a high-dimensional 
setting, and that without any further restriction on the structure of the covariance matrix S, 
then S cannot be a consistent estimator. However, we would like to point out that Proposition [3] 
relates the quality of S to the "true dimensionality" of the vector X = (X (t\) , X (t n )) T £ W 1 



that is measured by the quantity 
Gaussian process such that tr(E) ■ 



with Z 



X 



Indeed, if AT is a low-dimensional 



S-S 



1 then Proposition [3] and inequality (|3.2p imply that 
■ max (A 2 N , Bn) < exp ^— (5~ 1 x)z 



(3.6) 



for all x > 0, where Aj 



W = y 8/3 ^'° gjV ^|^ - aim n N = ^y- -f ||z^|| 2 j± N . 
(|3.6p shows that, under an assumption of low-dimensionality of the process X, the deviation in 
operator norm between S and S depends on the ratio i and not on jj, and thus the quality of 
S as an estimator of S is much better in such settings. 

More generally, another assumption of low-dimensionality for the process X is to suppose 
that it has a sparse representation in a dictionary of basis functions, which may also improve 
the quality of S as an estimator of S. To see this, consider the simplest case X = X°, where 
the process X° has a sparse representation in the basis {g m )i<m<M given by 



and Bp 



P 2 (S) 



nl/2 

+ ||S|| 2 An. Hence, inequality 



X°(t) 



y~] a m g m (t), t € T, 

me J* 



(3.7) 



where J* C {1, . . . , M} is a subset of indices of cardinality \J*\ = s* and a m , m £ J* are random 
coefficients (possibly correlated). Under such an assumption, the following proposition holds. 

Proposition 4 Suppose that X = X° with X° defined by \3. 7\ ) with < min(n, M). Assume 
that X satisfies Assumption^ and that the matrix Gj.Gj* is invertible, where Gj* denotes the 
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n x | J* | matrix obtained by removing the columns of G whose indices are not in J* . Then, there 
exists a universal constant 5* > such that for all x > 0, 



S - £ 



2* tn,s«x) < exp (-(CM^) , (3.8) 



where fjv, s » =max (^4Ar s )-B/v,s*)> ™^ 



4 nV2 \ || v|| Vl^aogiV) 1 / 



and 



/w(Gj,Gj»)\ p 2 (S) ( p max {Gj.G 




N "* ~ (GJ. G7) J "W ( G I* G P 1 " " 2 V " 

with d* = mm(N,s*) and Z = ||aj*||^ 2 , u>/iere aj* = (Gj» Gj* ) -1 Gj»X G 

Using Proposition [5] and Proposition H] it follows that 

- If X = X° is a Gaussian process then 

1 /2 

^^( pI|g£I ) ii S ii^yi^oo g iv)v« ( 3,) 

- If X = X° is such that the random variables a m are bounded by for some constant R > 0, 

then 

A N , S * < RMooJ^y/logd* (3.10) 
with H^Hoo = maxi< m < M ||g m ||oo where ||# m ||oo = sup tgT \g m {t)\. 



Therefore, let us compare the bounds (|3,9p and (|3.10p with the inequalities (|3.2[) and ()3.5p . 
It follows that, in the case X = X°, if the sparsity s* of X in the dictionary is small compared 
to the number of time points n then the deviation between S and S is much smaller than in the 
general case without any assumption on the structure of S. Obviously, the gain also depends on 

the control of the ratio Pmax (^*^ tj ' ) _ Note that in the case of an orthonormal design (M = n 

and G T G = I n ) then p max (Gj.G j*) = p m - m (Gj.Gj*) = 1 for any J*, and thus the gain in 
operator norm between S and 5] clearly depends on the size of ^ compared to jfy. Supposing 
that X = X° also implies that the operator norm of the error term U in the matrix regression 
model (|2.8p is controlled by the ratio % instead of the ratio -8= when no assumptions are made 
on the structure of XI . This means that if X has a sparse representation in the dictionary then 
the error term U becomes smaller. 
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3.2 An oracle inequality for the Frobenius norm 



1 1 1 2 

Consistency is first studied for the normalized Frobenius norm - HA))^ for an n x n matrix 
A. The following theorem provides an oracle inequality for the group Lasso estimator H\ = 
G* A G T . 

Theorem 1 Assume that X satisfies Assumption^ Let e > and 1 < s < min(n, M). Suppose 
that Assumption^ holds with c$ = 3 + 4/e, and that the covariance matrix S no j se = E (Wi) of 
the noise is positive- definite. Consider the group Lasso estimator S A defined by \2. 5\) with the 
choices 

Ik = 2 l|Gfc||^ 2 A/ p max (GG T ), 



and 



A 



•'noise || 2 



1 + 



+ 



25 log M 



N V N I 

Then, with probability at least 1 — M 1 ~ s one has that 



for some constant 5 > 1. 



< + 



+C(e 



inf 
* G 5m 
(*) < s 



p2 

"maxFmax 



Pmax(G T G) 



G*G 



H — || S 

F n 



J noise\\2 



1 + 



+ 



iV 



(3.11) 



2<5 log M\ M(*) 



w/iere ^ 



c fl(G)p max (G T G) S , and C(e) = 8^(1 + 2/e) 2 . 



The first term ± ||G*G 



^\\ F in inequality ()3,lip is the bias of the estimator H\. It 
reflects the quality of the approximation of X! by the set of matrices of the form G\I/G T , with 
\l/ G Sm and .M (*) < s. As an example, suppose that X = X°, where the process X° has a 
sparse representation in the basis (g m )i<m<M given by 

mS J* 

where J* C {1, . . . , M} is a subset of indices of cardinality \J*\ = s* < s and a m ,m G J* are 
random coefficients. Then, in this case, since s* < s the bias term in f)3.11j) is equal to zero. 
The second term i ||S — S||^ in (|3.1ip is a variance term as the empirical covariance matrix 

In 1 1 2 1 1 2 

S is an unbiased estimator of S. Using the inequality - ||^4||p < \\A\\2 that holds for any n x n 



matrix A, it follows that 



IS - Silt < IIS -S| 



Therefore, under the assumption that X has 



a sparse representation in the dictionary (e.g. when X = Xq as above) then the variance term 
^ || S — X||^ is controlled by the ratio < (see Proposition H]) instead of the ratio -8* without 
any assumption on the structure of S. 

The third term in (|3.1ip is also a variance term due to the noise in the measurements (jl.ip . 
If there exists a constant c > independent of n and N such that < c then the decay of this 
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third variance term is essentially controlled by the ratio — -j—*- < ^. Therefore, if Ai < s 
with sparsity s much smaller than n then the variance of the group Lasso estimator Y]\ is smaller 
than the variance of S. This shows some of the improvements achieved by regularization (|2,7p 
of the empirical covariance matrix S with a group Lasso penalty. 

An important assumption of Theorem [1] is that the covariance matrix of the noise ~E, no i se = 
E(Wi) is positive definite. This restriction is clearly necessary as illustrated by the following 
example: suppose that the contaminating process £ (t) = (gi(t) with ( ~ N(0, a\), implying 
that 5j no j se = crfgigj with gi = (gi(h), ... ,gi (i„)) T has n— 1 eigenvalues equal to zero. 
Now, suppose that X(t) = a2<?2(i) with a>2 ~ iV(0, erf). If o~\ > &2 then the group LASSO 
regularization alone cannot get rid of the additive error term without eliminating first the right 
component g<i- Hence, in such settings, group LASSO regularization does not yield to a consistent 
estimation of £ = cr|g 2 g 2 r with g 2 = (52 (*i ),■■■, 52 (*n)) T - 



3.3 An oracle inequality for the operator norm 



The "normalized" Frobenius norm - 

n 



i.e the average of the eigenvalues of 



mum eigenvalue of I X 



can be viewed as a reasonable proxy for the operator norm 
2 



(maxi- 



It is thus expected that the results of Theorem [T] imply that 

the group Lasso estimator H\ is a good estimator of X in operator norm. Let us recall that 
controlling the operator norm enables to study the convergence of the eigenvectors and eigen- 
values of £a by controlling of the angles between the eigenspaces of a population and a sample 
covariance matrix through the use of the sin(#) theorems in |Davis and Kahan, 19 70 . 

Now, let us consider the case where X consists in noisy observations of the process X° (|3,7p 
meaning that 

X(tj) = X (tj) + £ (tj) , j = 1, . . . , n, (3.12) 

where £ is a second order Gaussian process £ with zero mean and independent of X°. In this 
case, one has that 



S = G**G T , where ** = E (aa 1 



where a is the random vector of M M with a m = a m for m G J* and a m = for m ^ J*. 
Therefore, using Theorem Q] by replacing s by s* = |J*|, since € G Sm ■ M < s*}, 
one can derive the following corrollary: 

Corollary 1 Suppose that the observations Xi(tj) with i = 1, ...,N and j = 1, . . . ,n are i.i.d 
random variables from model L3.12\) and that the conditions of Theorem [7] are satisfied with 
1 < s = < min(n, M). Then, with probability at least 1 — M 1-5 one has that 



^noise) > 



(3.13) 



where 

C (n,M, N, s*,S,**,G,£ r 



= (1+*) - 
n 



S-G**G 



+ C(e) 



G max Pmax(G T G) 2 s 



St., co 



A — 

n , 
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To simplify notations, write \1/ = *f?\, with given by (|2,7p . Define J\ C {1, . . . , M} as 



/ : : ^= 



> Ci (n,M,A^,s,,S,**,G,5] noise ) ^ , with 5 k 



T k\\t 2 



(3.14) 



and Ci (n,M, N, s*, S,**, G, S r 



Ci with 



C 



i = max 7- 



-1/2 



1 + e 



A 



S-G**G 



2 ; 4 (1 + e) v 7 ^ ^ (yt) M> jy- — 



G, S r 



_ (3-15) 
with 7 m ax = 2G max y / p max (G T G). The set of indices J is an estimation of the set of active 
basis functions J* . Note that such thresholding procedure (|3.14p does not lead immediately to 
a practical way to choose the set J. Indeed the constant C\ in f|3. 14|) depends on the a priori 
unknown sparsity s* and on the amplitude of the noise in the matrix regression model (j2.8j) 
measured by the quantities ^ ||S — G\1/*G T ||„ and ||S„ j se [||. Nevertheless, in Section 0] on 
numerical experiments we give a simple procedure to automatically threshold the ^-norm of the 
columns of the matrix ty\ that are two small. 



Note that to estimate J* we did not simply take J = Jq 



k : 



7^ > , but rather 



apply a thresholding step to discard the columns of \1/ whose ^-norm are too small. By doing 
so, we want to stress the fact that to obtain a consistent procedure with respect to the operator 
norm it is not sufficient to simply take J = Jq. A similar thresholding step is proposed in 



Lounici, 2008 and Lounici et al., 2009 in the standard linear model to select a sparse set 



of active variables when using regularization by a Lasso or group-Lasso penalty. In the paper 
( |Lounici, 2008| ), the second thresholding step used to estimate the true sparsity pattern depends 
on a unknown constant that is related to the amplitude of the unknown coefficients to estimate. 
Then, the following theorem holds. 



Theorem 2 Under the assumptions of Corollary [7J for any solution of problem (2.1), we have 
that with probability at least 1 — M 1_<5 , 



o~k 

max —= 

Kk<M\ n 



< Ci (n, M, N, s*,S,**,G,:£ r 



(3.16) 



If in addition 



mm 

keJ* 



I* 



n 



k\\e 2 



> 2d (n,M,N, s*,S,**,G,£ n 



(3.17) 



then with the same probability the set of indices J, defined by ), estimates correctly the true 
set of active basis functions J* , that is J = J* with probability at least 1 — M 1_<5 . 



The results of Theorem [2] indicate that if the ^-norm of the columns of for k £ J* 



are sufficiently large with respect to the level of noise in the matrix regression model (|2.8p 
and the sparsity s*, then J is a consistent estimation of the active set of variables. Indeed, if 
Ai (VP*) = s*, then by symmetry the columns of VP* such VP*. 7^ have exactly s* non-zero 
entries. Hence, the condition (|3.17p means that the ^2-norm of VP*, 7^ (normalized by -^j=) has 
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to be larger than 4 ( 1+e ) \/s^^/Cq. A simple condition to satisfy such an assumption is that the 

amplitude of the s* non- vanishing entries of \l/t are larger than ^ 4 f 1+< ^ \ZCq which can be 
interpreted as a kind of measure of the noise in model (|2.8p . This suggests to take as a final 
estimator of S the following matrix: 



(3.18) 



where Gj denotes the n X | j\ matrix obtained by removing the columns of G whose indices are 
not in J, and 



j = argmm 
*6S |-?| 



S - Gj*Gj 



where S,j, denotes the set of | J\ X | J\ symmetric matrices. Note that if G^G f is invertible, then 



gJsGj (gJGj 



Let us recall that if the observations are i.i.d random variables from model (|3.12p then 

£ = G**G T , 

where ** = E ( aa. - ' - ') , and a. is the random vector of with a m — a m for m£ J* and a m = 
for m ^ J*. Then, define the random vector aj* G M. whose coordinates are the random 
coefficients a m for m G J*. Let ^j* = E(aj*aJ») and denote by Gj* the n x |J*| matrix 
obtained by removing the columns of G whose indices are not in J* . Note that S = G j* \l/ j* G J, . 



Assuming that Gj*Gj* is invertible, define the matrix 



Sj* = £ + Gj*(Gj,Gj») 1 Gj»S no j se Gj* ( G)»Gj* ) Gj, 



-i 



(3.19) 



Then, the following theorem gives a control of deviation between S ; and X! j* in operator norm. 



Theorem 3 Suppose that the observations are i.i.d random variables from model A3.12\) and 
that the conditions of Theorem [7] are satisfied with 1 < s = s* < min(n,M). Suppose that 
Gj.Gj* is an invertible matrix, and that 

min^= ||*£L > 2d (n, M, N, s*, S,**, G, £ noise ) , 



w/iere Ci (n, M, iV, s*, S,**, G, £ r 



(Gj.Gj* 



1 GT, X and Z 



\ e 2 



is the constant defined in i3.15\) . Let Y 
Let p(Ti noise ) = ( sup^gjjnjj^i 



£ = (£ (t\) , ■ Then, with probability at least 1 — M 

5* > <L one has that 



1-<5_M V<W 



iEI^H ' w/iere 
, with 5 > 1 and 



S,, 



< /W (Gj.Gj.) (log(M))— , 



(3.20) 
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where f N>St = max(A 2 Nst , B N)S J, with A N)S , = \\Z\\^ a ^^^^^—, B NjSt = 
^(i:,E„ ri , e )| n (G} t G J .) + (||^|| 2 + /0m i n(G T Gjt) \\v noise \\ 2 f 2 A N; ^, where d* = xmn(N,s*) 
and p(S, £„) = 8 x /4 ( p 4 ( S) + ^4 (£ noise )) 1/4 . 

First note that the above theorem gives a deviation in operator norm from 5]j to the matrix 
XI j* (|3.19p which is not equal to the true covariance 5] of X at the design points. Indeed, 
even if we know the true sparsity set J* , the additive noise in the measurements in model (|l,ip 
complicates the estimation of S in operator norm. However, although Ej* ^ S, they can have 
the same eigenvectors if the structure of the additive noise matrix term in ()3. 19|) is not too 
complex. As an example, consider the case of an additive white noise, for which Yl no i se = cr 2 I n 
where a is the level of noise and I n the n x n identity matrix. Under such an assumption, if we 
further suppose for simplicity that (Gj.Gj*) -1 = I s *, then S j. = E+(j 2 Gj* (Gj.Gj*) - ^^ = 
X! + cr 2 I n and clearly and XI have the same eigenvectors. Therefore, the eigenvectors of 
Xlj can be used as estimators of the eigenvectors of XI which is suitable for the sparse PCA 
application described in the next section on numerical experiments. 

Let us illustrate the implications of Theorem [3] on a simple example. If X is Gaussian, the 
random vector Y = (Gj*Gj*) Gj, (X + £) is also Gaussian and Proposition [2] can be used 
to prove that 



||% 2 < VmJtr (( G J. Gj*) l Gj. (S + S„ oise ) Gj* (Gj.Gj.; 
< 7873||S + V noise \\l /2 p-V 2 (Gj.Gj.) v 7 ^- 
Then Theorem [3] implies that with high probability 



2 



< Pmax ( Gj*Gj* J TiV, s »,l<5 (log(M)) 



2+a 



where fjv,a.,i = max (^Ar, s ,,,i> B NiStil ), with 

= TsTsiis + ^noiseWTp^L 2 (Gj.Gj,) Tbi^aogiV) 1 /-^ 



and 



r - ^ 2 ( S ' S "^)^min( G J« G J') | f\ m \\, n -ifr T r \ iiv nV /2 i 

£>N,s„l — 7= 1" I ||*J*||2 + Pmin I ^J*Gj* ) H^noiselh ) A/V>„1. 



N 

Therefore, in the Gaussian case (but also under other assumptions for X such as those in 

2 



depends on the 

2 



Proposition [2]) the above equations show that the operator norm 

ratio Recall that ||S — depends on the ratio jj. Thus, using clearly yields significant 
improvements if s* is small compared to n. 

To summarize our results let us finally consider the case of an orthogonal design. Combining 
Theorems [TJ [2] and [3] one arrives at the following corrolary: 
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Corollary 2 Suppose that the observations are i.i.d random variables from model liS. 12) . Sup- 
pose that M = n and that G T G = I n (orthogonal design) and that X° satisfies Assumption^ 
Let e > and 1 < < min(n, M). Consider the group Lasso estimator T,\ defined by l27l 
with the choices 



jk = 2, k = 1, . . . , n and A 
Suppose that 



,, . , ,n 25logM . , 
I ^"noise || 2 I 1 + \l + y ^ ) J or some constant o > 1. 



N 



min > 2n 1/2 Ci (er, n, s*, N, 5) , 



keJ* 



(3.21) 



where C\ (a, n, s, N, 5) = 4 ( 1+ ^)v^* / (j Q ^ n ^ S]|t; _/v, £) anc / 



C (a,n,s,,7V,5) = (1+e) - 

n 



S-G**G T 



+ C(e)||S no j se ||2 



,n 28\ogM\ s 



Take J := < k : 







{, 





> n^ 2 Ci (a,n,s,N,S) \ . Let Y = Gj*X and Z = \\Y\\e 2 . Then, with 



probability at least 1 — M 



1-8 



M v 5 */ with 5 > 1 and 5+ > 5* one has that 



where fjv, s , = max(74^ r , i , -Bjv>*)> with An r 

(II* J* II2 + l|S n oise|| 2 ) 1//2 4/V,s»- 



<f7v, s ,^(log(M)) 2 « a 



l^lk ^ — andB N>Sf 



(3.22) 



p 2 (S,S nQ j se ) 



+ 



3.4 Comparison with the standard Lasso 

In this work, we chose a Group Lasso estimation procedure rather than a standard Lasso. As 
a matter of fact, for covariance estimation in our setting, the group structure enables to impose 
a constraint on the number of non zero columns of the matrix ^ and not on the single entries of 
the matrix This corresponds to the natural assumption of obtaining a sparse representation 
of the process X(t) in the basis given by the functions g m 's and replacing its dimension by its 
sparsity. Alternatively, the standard Lasso in our setting would be the estimator defined by 



iJ?L = argmin 
*eS M 



S-G*G 



T 



M M 

+ ^ lmk\^mk\ 

k=l m=l 



where A > is a regularization parameters and the 7 m fc's are positive weights. This procedure 
leads to the following Lasso estimator of the covariance matrix S 



G* L G' € 



(3.23) 



In the orthogonal case (i.e. M = n and G T G = I n ), this gives rise to the estimator obtained 
by soft thresholding individually each entry Y m k of the matrix Y = G T SG with the thresholds 
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^Imk- Proposition [5] (see below) allows a simple comparison of the statistical performances of 
the group Lasso estimator 5] l with those of the standard Lasso estimator H\ in terms of upper 
bounds for the Frobenius norm. To simplify the discussion, we only consider the orthogonal 
case and the simple model 



X{t j )=X°(t j )+£ (tj), j = l,...,n, 



(3.24) 



where the process X° is defined in (|3.7p . The statement of the result for the group Lasso is 
an immediate consequence of Theorem [lj while the proof to obtain the upper bound for the 
standard Lasso is an immediate adaptation of the arguments in the proof of Theorem [TJ 



Proposition 5 Assume that X satisfies model (|3.24|) and that the covariance matrix Yl no i se = 
E (Wi) of the noise is positive-definite. Consider the group Lasso estimator 5] a and the standard 
Lasso estimator XI £ with the choices 



Ik = 2, 7; 



ink 



2, A 



J noise\\2 



2 + 



25 log M 



N 



for some constant 5 > 1. 



Then, there exist two positive constants C\,C2 not depending on n,N,s* such that with proba- 
bility at least 1 — M 1 ^ 5 one has that 



Ci 



< 

F n 



IS -E||i + CallE 



■ 1 2 

noise\\2 



2 + 



25 log n \ 



N 



n 



and 



Ci 



< 

F n 



IS -S||i + CallS 



noise\\2 




Proposition [5] illustrates the advantages of the Group Lasso over the standard Lasso. Indeed, 
the second term in the upper bound for the group Lasso is much smaller (of the order — ) than 

2 

the second term in the upper bound for the standard Lasso (of the order — ). This comes from 
the fact that the sparsity prior of the Group Lasso is on the number of vanishing columns of the 
matrix ^f, while the sparsity prior of the standard Lasso only controls the number of non-zero 
entries of ^f. However, to really demonstrate the benefits of our method when compared to the 
performances of the standard Lasso, it is required to also derive lower bounds. This issue is a 
difficult task which has been considered in few papers and that is beyond the scope of this paper. 
For recent work in this direction, we refer to [Hu ang and Zhang, 2010] for regression models or 
|Lounici et al., 2011] and |Lounici et al., 2 009 for linear regression and multi-task learning. 

However, the analysis in |Huang and Zhang, 2010 Lounici et al., 201 1| of Group Lasso reg- 
ularization is carried out the setting of multiple regression models where the parameters 
to estimate are vectors and with error terms that are centered. Therefore, the results in 
|Huang and Zhang, 2010 Lounici et al., 201 1| cannot be applied to the matrix regression model 
(|2.4p since, in our setting, the parameter to estimate is the matrix I] and the error terms Uj+Wj 
in (12.41) are not centered. 
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4 Numerical experiments and an application to sparse PCA 



In this section we present some simulated examples to illustrate the practical behaviour of the 
covariance matrix estimator by group Lasso regularization proposed in this paper. In particular, 
we show its performances with an application to sparse Principal Components Analysis (PCA). 
In the numerical experiments, we use the explicit estimator described in Proposition [1] in the 
case M = n and an orthogonal design matrix G, and also the estimator proposed in the more 
general situation when n < M. The programs for our simulations were implemented using the 
MATLAB programming environment. 

4.1 Description of the estimating procedure and the data 

We consider a noisy stochastic processes X on T = [0, 1] with values in R observed at fixed 
location points t%, ...,t n in [0, 1], generated according to 

X(t j ) = X°(t J )+ae j , j = l,...,n, (4.1) 

where a > is the level of noise, e\, . . . ,e n are i.i.d. standard Gaussian variables, and X° is 
a random process independent of the e,-'s. For the process X° we consider two simple models. 
The first one is given by 

X°(t) = af(t), (4.2) 

where a is a Gaussian random coefficient such that Ea = 0, Ea 2 = -y 2 , and / : [0, 1] — > R is an 
unknown function. The second model for X° is 

X°(t) = ai/i(t) + a 2 / 2 (i), (4.3) 

where a± and a 2 are independent Gaussian variables such that Eai = Ea 2 = 0, Ea 2 = jf, 
Ea| = 7 2 (with 71 > 72), and /i,/ 2 : [0,1] — > R are unknown functions. The simulated data 
consists in a sample of N independent observations of the process X at the points t\,...,t n , 
which are generated according to (|4.ip . Therefore, throughout the numerical experiments, one 
has that 

In model f)4.2[) . the covariance matrix £ of the process X° at the locations points is given 
by S = 7 2 FF T , where by definition 

F=(/(t 1 ),...,/(t 1 )) T GR". 

Note that the largest eigenvalue of S is 7 2 ||F||| 2 with corresponding eigenvector F. We suppose 
that the signal / has some sparse representation in a large dictionary of basis functions of size M, 
given by {g m , m = 1, . . . , M}, meaning that / (t) = J2m=i Pmg m it) , with J* = {m, /3 m / 0} 
of small cardinality s*. Then, the process X° can be written as X°(t) = J2m=i a fim9rn (t) , 
and thus 5] = 7 2 G\l/j*G T , where is an M x M matrix with entries equal to f3 m Pm' for 
1 < m, m' < M. 
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Similarly, in model (|4,3p . the covariance matrix S of the process X° at the locations points 
is given by S = 7 2 FiFj~ + 7IF2FJ, where by definition 

Fi = (fx (ti) , / (ti)) T G R n and F 2 = (/ 2 (t x ) , / (h)) T G K". 

In the following simulations, the functions f% and / 2 are chosen such that Fi and F 2 are or- 
thogonal vectors in R n with ||Fi||^ 2 = 1 and ||F 2 ||^ 2 = 1. Under such an assumption and since 
71 > 72, the largest eigenvalue of X! is -yf with corresponding eigenvector Fi, and the second 
largest eigenvalue of S is 7 2 with corresponding eigenvector F 2 . We suppose that the signals f\ 
and fi have some sparse representations in a large dictionary of basis functions of size M, given 
by fx (*) = Em=i Pm9m (*) , and / 2 (i) = £m=l Pm9m (*)■ Then, the process X° can be written 
as = E^LlK/^m + o^m^m it) and thus £ = G( 7l 2 * 1 + 7 2 * 2 )G T , where V 1 ,* 2 are 

M x M matrix with entries equal to fimiPm)' and (3m(Pm)' f° r 1 < m, m' < M respectively. 

In models (|4.2j) and ()4.3p . we aim at estimating either F or Fi,F 2 by the eigenvectors 
corresponding to the largest eigenvalues of the matrix 5]j defined in (j3. 18|) . in a high-dimensional 
setting with n > N and by using different type of dictionaries. The idea behind this is that 
Sj is a consistent estimator of Sj* (see its definition in I3.19j) in operator norm. Although the 
matrices X! j* and X may have different eigenvectors (depending on the design points and chosen 
dictionary), the examples below show the eigenvectors of Xlj can be used as estimators of the 
eigenvectors of S. 

The estimator XIj of the covariance matrix S is computed as follows. Once the dictionary 
has been chosen, we compute the covariance group Lasso (CGL) estimator = G\I/^G T , 
where is defined in (|2.7|) . We use a completely data-driven choice for the regularizarion 

parameter A, given by A = ||£„ ise||2 (l + + ^/ 2l51 ^ M ^ , where ||£ no j se || 2 = a 2 is the 

median absolute deviation (MAD) estimator of a 2 used in standard wavelet denoising (see e.g. 
Antoniadis et al., 2001 1 ) and 5 = 1.1. Hence, the method to compute 5]^ is fully data-driven. 



Furthermore, we will show in the examples below that replacing A by A into the penalized 
criterion yields a very good practical performance of the covariance estimation procedure. 

As a final step, one needs to compute the estimator 5] j of XI, as in ()3.18p . For this, we 
need to have an idea of the true sparsity s*, since J defined in ()3.14p depends on s* and also 
on unknown upper bounds on the level of noise in the matrix regression model (|2.8p . A similar 
problem arises in the selection of a sparse set of active variables when using regularization by 
a Lasso penalty in the standard linear model. As an example, recall that in |Lounici,~2 008 , 
a second thresholding step is aso used to estimate the true sparsity pattern. However, the 
suggested thresholding procedure in |Lounici, 2008| also depends on a priori unknown quantities 
(such as the amplitude of the coefficients to estimate). To overcome this drawback in our case, 
we can define the final covariance group Lasso (FCGL) estimator as the matrix 

£./ gA/Gj ( 4 - 4 ) 

with J = J e = < k : > ^ n where e is a positive constant. To select an appropriate value 

of e, one can plot the cardinality of J e as a function of e, and then use an L-curve criterion to 
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only keep in J the indices of the columns of with a significant value in ^-norm. This choice 
for J is sufficient for numerical purposes. 

In the simulations, to measure the accuracy of the estimation procedure, we also use 
the empirical average of the Frobenius and operator norm of the estimators and £j 



p 

with respect to the true covariance matrix £ defined by EAFN = p E 

P =i 

p 



s 5 



and 

F 



EAON = i £ 
P =i 



respectively, over a number P of iterations, where S- and S - are 
2 A J 



the CGL and FCGL estimators of S, respectively, obtained at the p-th iteration. We also com- 
pute the empirical average of the operator norm of the estimator £j with respect to the matrix 

p 



p 

£j*, defined by EAON* = p E 

p=i 



J J 



4.2 Model (14. 2 p - case of an orthonormal design (with n = M) 

First, the size of the dictionary M as well as the basis functions {g m ,m = 1, ...,M} have 
to be specified. In model (|4,2p . we will use for the test function / the signals HeaviSine and 
Blocks (see e.g. |Antoniadis et al., 2001 for a definition), and the Symmlet 8 and Haar wavelet 



basis for the HeaviSine and Blocks signals respectively, which are implemented in the Matlab's 
open-source library WaveLab (see e.g. |Antoniadis et aL"7"2 001 1 for further references on wavelet 
methods in nonparametric statistics). Then, we took n = M and the location points t\,...,t n 
are given by the equidistant grid of points tj = j = 1, . . . , M such that the design matrix G 
(using either the Symmlet 8 or the Haar basis) is orthogonal. 

Figures 1, 2, and 3 present the results obtained for a particular simulated sample of size 
N = 25 according to (|4.ip . with n = M = 256, a = 0.015, 7 = 0.5 and with / being either the 
function HeaviSine or the function Blocks. It can be observed in Figures 1(a) and 1(b) that, as 
expected in this high dimensional setting (N < n), the empirical eigenvector of S associated to 
its largest empirical eigenvalue does not lead to a consistent estimator of F. 

The CGL estimator is computed directly from Proposition [TJ In Figures 2(a) and 2(b), 
we display the eigenvector associated to the largest eigenvalue of as an estimator of F. Note 
that this estimator behaves poorly. The estimation considerably improves by taking the FCGL 
estimator Sj defined in (|4.4p . Figures 3(a) and 3(b) illustrate the very good performance of the 
eigenvector associated to the largest eigenvalue of the matrix Sj as an estimator of F. 

It is clear that the estimators and £j are random matrices that depend on the observed 
sample. Tables 1(a) and 1(b) show the values of EAFN, EAON and EAON* corresponding 
to P = 100 simulated samples of different sizes N and different values of the level of noise a. 
It can be observed that for both signals the empirical averages EAFN, EAON and EAON* 
behaves similarly, being the values of EAON smaller than its corresponding values of EAFN 
as expected. Observing each table separately we can remark that, for N fixed, when the level 
of noise a increases then the values of EAFN, EAON and EAON* also increase. By simple 
inspection of the values of EAFN, EAON and EAON* in the same position at Tables 1(a) and 
1(b) we can check that, for a fixed, when the number of replicates N increases then the values 
of EAFN, EAON and EAON* decrease in all cases. We can also observe how the difference 
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between EAON and EAON* is bigger as the level of noise increases. 



Table 1(a). Values of EAFN, EAON and EAON* corresponding to signals 
HeaviSine and Blocks for M = n = 256, N = 25. 



Signal 


a 


0.005 


0.01 


0.05 


0.1 


0.5 


1 


HeaviSine 


EAFN 


0.0634 


0.0634 


0.2199 


0.2500 


0.2500 


0.2500 


HeaviSine 


EAON 


0.0619 


0.0569 


0.1932 


0.2500 


0.2500 


0.2500 


HeaviSine 


EAON* 


0.0619 


0.0569 


0.1943 


0.2600 


0.5000 


1.2500 


Blocks 


EAFN 


0.0553 


0.0681 


0.2247 


0.2500 


0.2500 


0.2500 


Blocks 


EAON 


0.0531 


0.0541 


0.2083 


0.2500 


0.2500 


0.2500 


Blocks 


EAON* 


0.0531 


0.0541 


0.2107 


0.2600 


0.5000 


1.2500 



Table 1(b). Values of EAFN, EAON and EAON* corresponding to signals 
HeaviSine and Blocks for M = n = 256, N = 40. 



Signal 


a 


0.005 


0.01 


0.05 


0.1 


0.5 


1 


HeaviSine 


EAFN 


0.0501 


0.0524 


0.1849 


0.2499 


0.2500 


0.2500 


HeaviSine 


EAON 


0.0496 


0.0480 


0.1354 


0.2496 


0.2500 


0.2500 


HeaviSine 


EAON* 


0.0496 


0.0480 


0.1366 


0.2596 


0.5000 


1.2500 


Blocks 


EAFN 


0.0485 


0.0494 


0.2014 


0.2500 


0.2500 


0.2500 


Blocks 


EAON 


0.0483 


0.0429 


0.1871 


0.2500 


0.2500 


0.2500 


Blocks 


EAON* 


0.0483 


0.0429 


0.1893 


0.2600 


0.5000 


1.2500 



4.3 Model (14. 3 p - the case M = 2n by mixing two orthonormal basis 

Consider now the setting of model (|4.3p with 71 = 0.5, 72 = 0.2, a = 0.045, N = 25 and an 
equidistant grid of design points t\, ...,t n given by tj = ^, j = 1, . . . , n with n = 128. For the 
signals /1 and fa we took the test functions displayed in Figure 4(a) and 4(b). Obviously, the 
signal fi has a sparse representation in a Haar basis while the signal has a sparse representation 
in a Fourier basis. Thus, this suggests to construct a dictionary by mixing two orthonormal 
basis. More precisely, we construct a n x n orthogonal matrix G 1 using the Haar basis and a 
n x n orthogonal matrix G 2 using a Fourier basis (cosine and sine at various frequencies) at 
the design points. Then, we form the n x M design matrix G = [G 1 G 2 ] with M = 2n. The 
CGL estimator is computed by the minimization procedure (|2.7p using the Matlab package 
minConf of |Schmidt et al., 2008| . 

In Figures 5(a) and 5(b), we display the eigenvector associated to the largest eigenvalue of 

as an estimator of Fi, and the eigenvector associated to the second largest eigenvalue of St 
as an estimator of F2. Note that these estimators behaves poorly. The estimation considerably 
improves by taking the FCGL estimator Sj defined in (|4.4|) . Figures 6(a) and 6(b) illustrate 
the very good performance of the eigenvectors associated to the largest eigenvalue and second 
largest eigenvalue of the matrix Sj as estimators of Fi and F2. 

Finally, to illustrate the benefits of mixing two orthonormal basis, we also display in Figures 
7 and 8 the estimation of Fi and F2 when computing the matrix Sj by using either only the 
Haar basis (i.e. G = G 1 and M = n) or only the Fourier basis (i.e. G = G 1 and M = n). The 
results are clearly much worse and not satisfactory. 
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4.4 Model (14. 2 p - case of non equispaced design points such that n < M 

Let us now return to the setting of model ()4.2p . The test functions / are either the signal 
HeaviSine and or the signal Blocks. We also use the Symmlet 8 and Haar wavelet basis for the 
HeaviSine and Blocks functions respectively. However, we now choose to take a setting where 
the number of design points n is smaller than the size M of the dictionary. Taking n < M, 
the location points are given by a subset {t±, ...,t n } C {jj : k = 1, ...,M} of size n, such that 
the design matrix G is an n x M matrix (using either the Symmlet 8 and Haar basis). For 
a fixed value of n, the subset {t\, ...,t ra } is chosen by taking the first n points obtained from 
a random permutation of the elements of the set {jj, jj, 1}. Figures 9 and 10 present the 
results obtained for a particular simulated sample of size N = 25 according to (|4.ip . with n = 90, 
M = 128, a = 0.02, 7 = 0.5 and with / being either the function HeaviSine or the function 
Blocks. It can be observed in Figures 9(a) and 9(b) that, as expected in this high dimensional 
setting (N < n), the empirical eigenvector of S associated to its largest empirical eigenvalue 
are noisy versions of F. As explained previously, the CGL estimator is computed by the 
minimization procedure (|2.7|) using the Matlab package minConf of Schmidt et al., 2008 . In 



Figures 10(a) and 10(b) is shown the eigenvector associated to the largest eigenvalue of as 
an estimator of F. Note that this estimator is quite noisy. Again, the eigenvector associated to 
the largest eigenvalue of the matrix defined in (|4.4p is much a better estimator of F. This is 
illustrated in Figures 11(a) and 11(b). To compare the accuracy of the estimators for different 
simulated samples, we compute the values of EAFN, EAON and EAON* with fixed values of 
a = 0.05, M = 128, N = 40, P = 50 for different values of the number of design points n. For 
all the values of n considered, the design points t\,...,t n are selected as the first n points obtained 
from the same random permutation of the elements of the set {jj, jj, 1}- The chosen subset 
{ti, ...,t n } is used for all the P iterations needed in the computation of the empirical averages 
(fixed design over the iterations). Figure 12 shows the values of EAFN, EAON and EAON* 
obtained for each value of n for both signals HeaviSine and Blocks. It can be observed that the 
values of the empirical averages EAON and EAON* are much smaller than its corresponding 
values of EAFN as expected. We can remark that, when n increases, the values of EAFN , 
EAON and EAON* first increase and then decrease, and the change of monotony occurs when 
n > N. Note that the case n = M = 128 is included in these results. 
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Orthonormal case - Model (|4.2p 

0.15 




-0.15 1 1 1 1 1 1 1 1 ' 1 

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Figure 1(a). Signal HeaviSine and 
Eigenvector associated to the largest 
eigenvalue of S 
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Figure 1(b). Signal Blocks and 
Eigenvector associated to the largest 
eigenvalue of S 
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Figure 2(a). Signal HeaviSine and 
Eigenvector associated to the largest 
eigenvalue of S-^ 
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Figure 3(a). Signal HeaviSine and 
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Figure 2(b). Signal Blocks and 
Eigenvector associated to the largest 
eigenvalue of 
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Figure 3(b). Signal Blocks and 



Eigenvector associated to the largesl^ Eigenvector associated to the largest 
eigenvalue of eigenvalue of Sj 



Case M = 2n (Haar + Fourier basis) 
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Figure 4(a). Signal F\ 
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Figure 4(b). Signal F 2 
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Figure 5(a). Signal F\ and Eigenvector 
associated to the largest eigenvalue of 



Figure 5(b). Signal F 2 and 
Eigenvector associated to the second 
largest eigenvalue of 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Figure 6(b). Signal F\ and 
Eigenvector associated to the largest 
eigenvalue of %j with G = [G 1 G 2 ] 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Figure 6(b). Signal F2 and Eigenvector 
associated to the second largest 
eigenvalue of £ j with G = [G 1 G 2 ] 
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Orthonormal case M = n (Haar) 



' 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Figure 7(a). Signal F 1 and Eigenvector Figure 7(b). Signal F 2 and 

associated to the largest eigenvalue of Eigenvector associated to the second 

largest eigenvalue of with G = G 1 



S T with G = G 



Orthonormal case M = n (Fourier) 




Figure 8(a). Signal F\ and Eigenvector Figure 8(b). Signal F 2 and 

associated to the largest eigenvalue of Eigenvector associated to the second 
Y,j with G = G 2 largest eigenvalue of £ j with G = G 2 
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Non equi-spaced points with n < M 




Figure 9(a). Signal HeaviSine and 
Eigenvector associated to the largest 
eigenvalue of S 




Figure 10(a). Signal HeaviSine and 
Eigenvector associated to the largest 
eigenvalue of 




Figure 11(a). Signal HeaviSine and 
Eigenvector associated to the largest^ 
eigenvalue of X j 
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Figure 9(b). Signal Blocks and 
Eigenvector associated teethe largest 
eigenvalue of S 

















1 
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Figure 10(b). Signal Blocks and 
Eigenvector associated to the largest 
eigenvalue of 
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' 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Figure 11(b). Signal Blocks and 
Eigenvector associated to the largest 
eigenvalue of I] j 




A.l Notations 



First let us introduce some notations and properties that will be used throughout this Ap- 
pendix. The vectorization of a p x q matrix A = (ciij)i<i<p,i<j<q is the pq x 1 column vector 
denoted by vec (A), obtain by stacking the columns of the matrix A on top of one another. That 
is vec(A) 
and B = 
denoted by A 



[an, ...,a p i,a 12 ,...,a P 2, ...,ai q , ...,a pq ] T . If A = {aij)i<i< k ,i<j<n is a fc x n matrix 
ij)i<i<p,i<j<q 1S a P x 1 m &trix, then the Kronecker product of the two matrices, 



B, is the kp x nq block matrix 

"onB 



A <g) B 



ai„B 



OfeiB . . . afc„B 



In what follows, we repeatedly use the fact that the Frobenius norm is invariant by the vec 
operation meaning that 

||A||^ = || Uec (A)||| 2 , (A.l) 

and the properties that 

vec (ABC) = (C T ® A) vec (B) , (A.2) 

and 

(A tg) B)(C tg) D) = AC (g) BD, (A.3) 
provided the above matrix products are compatible. 
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A. 2 Proof of Proposition Q] 

Lemma 1 Let \l/ = ^\ denotes the solution of (|2.7p . Then, for k = 1, 



(G G) T vec(S) - (G (8) G)vec(*; 



km 



(G G) T uec(S) - (G ® G)vec(*; 



< A7 fc 



«2 



where *&k denotes the k-th column of the matrix & and the notation 
Wk,m)m=l,...,M in R M for a vector j3 = (Pk,m)k,m=l,...,M G 



.,M 

if ^ 

</ = 
l fc denotes the vector 



Proof of Lemma [T] For * G 



»MxM 



define 



L(*) 



S-G*G 



T 



uec(S) - (Gt»G)uec(*; 



and remark that \I/ is the solution of the convex optimization problem 



^ = argmin < 



M 



k=l \ m=l 



M 



E* 



2 

mk 



It follows from standard arguments in convex analysis (see e.g. [Boyd and Vandenberghe, 2004Q, 
that ^ is a solution of the above minimization problem if and only if 





M \ 




J2^mk) 




m=l J 



where VL(\I/) denotes the gradient of L at \I/ and d denotes the subdifferential given by 

M 



k i v=i y 1 11 11 - 

where 0/t denotes the fc-th column of G y^MxM wi^h completes the proof. 



M 



E* 



2 



G 



pMxM . 



: ©fc = 7fcl 



if * fe ^0,||e fc || /a < 7 fc if *fc = 



□ 



Now, let \I/ G 5a/ with M = n and suppose that G T G = I n . Let Y = {Y m k)i<m,k<M = 
G T SG and remark that vec(Y) = (G G) T v ec(S). Then, by using Lemma [1] and the fact 
that G T G = I n implies that (G ® G) T (G®G) = I n a, it follows that * = * A satisfies for 
= 1, . . . , M the following equations 



*fe 1 + 



Y fc for all ^ 0, 
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and 

M 



x Y, Y lk< for all = 0. 

\ m=l 



where = {^ m k)i<m<M G K M and Y fc — ("Y mk)i<m<M G R^, which implies that the solution 
is given by 



which completes the proof of Proposition [TJ □ 
A. 3 Proof of Proposition [2] 



First suppose that X is Gaussian. Then, remark that for Z = ||X||^ 2 , one has that ||^||^ 2 < 

1/2 

01 



-oo which implies that ||^||^ 2 = ||Z 2 ||i . Since Z 2 = X^=i l^(*i)| 2 ^ follows that 



?1 



where 2Tj = i = 1, . . . , n and S# denotes the ith diagonal element of S. Then, the result 

follows by noticing that [|y||^ < \/8/3 if Y ~ N(0, 1). The proof for the case where X is 

— 1/2 

such that 1 1 Z 1 1 ^ < +oo and there exists a constant C\ such that ||S^ ^i||^ 2 — Ci fo r an 
i = 1, . . . , n follows from the same arguments. 

Now, consider the case where X is a bounded process. Since there exists a constant R > 
such that for all t G T, < f?, it follows that for Z = ||X||^ 2 then Z < ^/nR which 

implies that for any a > 1, ||-Z||^ Q < ^/nR(log2)~ 1 / a , (by definition of the norm ||Z||^ a ) which 
completes the proof of Proposition [21 □ 

A. 4 Proof of Proposition 3] 

Under the assumption that X = X°, it follows that £ = G**G T with = E(aa T ), 
where a is the random vector of M M with a m = a m for m G J* and a m = for m ^ J*. 
Then, define the random vector aj* G ~R J * whose coordinates are the random coefficients a m 
for m G J*. Let *j* = E(aj*aJ»). Note that £ = Gj**j*Gj* and S = Gj**j*Gj», with 
= YliLi a j*( a j*) T ' where aj» G M" 7 * denotes the random vector whose coordinates are 
the random coefficients a l m for m G J* such that X; L {t) = ^meJ* a m9m(t), t G T. 

Therefore, \l/ j* is a sample covariance matrix of size x and we can control its deviation in 
operator norm from \l/ j* by using Proposition^ For this we simply have to verify conditions sim- 
ilar to (Al) and (A2) in Assumption [2] for the random vector aj* = (Gj»Gj*) _1 Gj»X G M. s *. 

First, let /3 G W* with \\/3\\e 2 = 1. Then, remark that aJ„/3 = X T ,S with /3 = Gj» (Gj.Gj*) -1 /3. 

~ — 1 /2 

Since ||/3||| 2 < (p m - m (Gj*Gj*)) and using that X satisfies Assumption [2] it follows that 

(E|aJ,/3| 4 ) 1/4 < p(S)p"f (Gj.G,.) . (A.4) 
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Now let Z = ||aj*||^ 2 < p m |,{ 2 (Gj*Gj*) ||X||^ 2 . Given our assumptions on X it follows that 
there exists a > 1 such that 



l^lk < Pjn (gJ.Gj* ) \\Z\\ fa < +OC, 



(A.5) 



where Z = ||X||£ 2 . Hence, using the relations (jA.4j) and (|A.5p . and Proposition (with aj* 
instead of X), it follows that there exists a universal constant (5* > such that for all x > 0, 



J* 



*7* 



^ ^*,Af,s»,l X ) ^ eX P 1 X) 2 + 



where f^jv^i = max(A 2 d * N s ^ v B d * jN , s *,l), with A^s,,! = ||Z||^^^ ! '"^ A — 



Bd*,N, St ,, 



'N 



' N 



+ ||*J* 



1 1/2 



* 1 1 2' Ai*,iV,s*,l 



and <i* = min(./V, s*). Then, using the 



inequality ||S — S||2 < Pmax (G»J,Gj*) — *j*||2j it follows that 

|S — S||2 > Pmax ( Gj.Gj. ) fd*,N,s*,lX , 



< 



Pmax I G J» G j 



J* 



^ Pmax (Gj»Gj» ) Trf. jv s iX 



^ exp (-(CM^" 
Hence, the result follows with 



TiV.s, = Pmax I Gj*Gj» ) f d *,jv., j 



max(p max (Gj*Gj*J A d ^ Nstl ,p max [Gj*Gj*J B d 
m&x(A2* jArSi , B d * ,n,s* ) , 



where I d * jJV)S , = (Gj*Gj*) 



yiog ^(logAf) 1 /" 



J* II2 



G J* G j* ) GT* SG./» ( GT* Gj* 



and, using the inequality 
-1 /W 

'min 



^ Pmin G J* G J* 



12 ' 



5, 



iax(Gj» Gj* ) 

ftnin(Gj,Gj») J VN ' \ p mi „(Gj„ Gj.) 



1 (GJ 1 G £1 )\ p2(S) 



+ 



1/2 



I ^ II 1/2 , 



A.5 Proof of Theorem Q] 

Let us first prove the following lemmas. 
Lemma 2 Let 81, ...,£jv fre independent copies of a second order Gaussian process E with zero 

N 

mean. Let W = j? Wj toit/i 

i=l 



and Si = (Si(ti) ,...,Si(t n )) , i = 1, ...,iV. 



30 



Suppose that Yl no i se = E(Wi) is positive-definite. For 1 < k < M, let % be the k-th column of 
the matrix G T WG. Then, for any x > 0, 



yVkWh > \\G k \\e 2 Y /, pm ax (GG T )||S noise ||2 ^1 + + j 



< exp(— x). 



Proof of Lemma [2} by definition one has that ||??fc||| 2 = G^TWGGWGfc where G& denotes 
the k-th column of G. Hence 

\\Vk\\l < ||G fc ||| 2 p max (GG T )||W||2. (A.6) 

— 1/2 

Using the assumption that 'Snoise is positive-definite define the random vectors Z{ = Y] no } se £i,i = 
1, . . . ,n. Note that the Zj's are i.i.d. Gaussian vectors in W 1 with zero mean and covariance 
matrix the identity. Then, define the N x n matrix 



r = 1 



( ZJ 

Since T is a matrix with i.i.d. entries following a Gaussian distribution with zero mean 

and variance 1/N, it follows from the arguments in the proof of Theorem 11.13 in 
|Davidson and Szarek, 2001| that for any x > 

|r T r|| 2 > (\ + ^ + | < exp(-x). (A.7) 

1/2 T 1/2 i ii T 

Now, since W = Y,^ oise T rE n ' oise it follows that ||W|| 2 < ||S no ise||2||r r|| 2 . Hence, inequality 
(|A.7p implies that for any x > 

|W|| 2 > ||E noise || 2 ^1 + ^ + ^] | < exp(-x), 

and the result finally follows from inequality (|A,6j) . □ 

Lemma 3 Let 1 < s < min(n,M) and suppose that Assumption^ holds for some cq > 0. Let 
J C {1, . . . , M} be a subset of indices of cardinality \ J\ < s. Let A G Sm and suppose that 

X, \\ A k\\e 2 < c o^2\\A k \\e 2 , 

fceJ c fceJ 

where A^ denotes the k-th column of A.. Let 

Pmin(s) 2 " C e(G) PmSLX (G T G)s" 
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Then, 



GAG 



,c f . \\^J\\F> 



where Aj denotes the M x M matrix obtained by setting to zero the rows and columns of A 
whose indices are not in J. 

Proof of Lemma [3} first let us introduce some notations. For A E Sm and J C {1, . . . , M}, 
then Ajc denotes the M x M matrix obtained by setting to zero the rows and columns of A 
whose indices are not in the complementary J c of J. Now, remark that 



GAG 



> 



GA/G 



GAjG 



+ 



GAjcG 



+ 2tr ( GAjG T GAjcG T 



+ 2tr ( GA jG GAjcG 



(A. 



Let A = GA,/G T and B = GAjcG T . Using that tr (A T B) = vec(A) 1 vec(B) and the 
properties (jA.ip and (|A.3j) it follows that 

tr (GAjG T GAjcG T ) = «ec(Aj) T (g t G ® G t g) wec(Ajc). (A.9) 

Let C = G T G ® G T G and note that C is a M 2 x M 2 matrix whose elements can be written in 
the form ofMxM block matrices given by 

C i:j = (G T G) ii G T G, for 1 < i, j < M. 

Now, write the M 2 x 1 vectors v ec(Aj) and uec(Ajc) in the form of block vectors as uec(Aj) = 



[(A 



j)J]J<i<M 



and vec(Ajc) = [(Ajc)j ]^< M , where (Aj); G K (Ajc), G R for 1 < i, j < 



M. Using (|A79|) it follows that 



tr ( GA/G 1 GAjcG 



J<= 



= Yl (Aj)7C^(A JC ), 

l<ij<M 

«eJ jeJ c 

Now, using that |(G T G)ij| < 9(G) for i ^ j and that 

(A J )7G T G(A JC ),| < ||G(A J ) i || £2 ||G(A JC ),|| £2 < Pmax (G T G)||(A J ) l ||, 2 ||(A JC ) J ||, 2 , 
it follows that 

tr (gAjG t GAjcG t ) > — #(G)/) max (G T G) [J^ IKAj)^ J j ]T IKAj^U 

VieJ / \jeJ c 

Now, using the assumption that X]fceJ c l|Afc||£ 2 < cq X^fceJ II Afc||^ 2 it follows that 

tr (gAjG t GAjcG t ) > -c o 0(G)/w(G T G) ||(A J ) i || i2 > ) 

VieJ / 



> -c o 0(G)p max (G T G) S ||A J || 2 ,, 



(A.10) 
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where, for the inequality, we have used the properties that for the positive re- 
als q = ||(Aj)j|| & , i G J then (Y, ieJ Ci) < \J\J2ieJ c i ^ s Y^ie.j c i an d that 

E te j[i(Aj)ia=iiA J [ii.. 

Using the properties (jA.ip and (|A.2j) remark that 



GAjG 1 



= \\Gj®Gjvec(Aj)\\l 

> p mi n{Gj®Gj)\\vec(Aj)\\ 2 h 

> Pmin(s) 2 ||Aj||p , 



(A.ll) 



where vec(Aj) = [(A.j)J]J eJ . Therefore, combining inequalities (|A.8j) . (|A.10p and (jA.lip it 
follows that 



GAG 



which completes the proof of Lemma [3l 



i) 



(G'G)s IIA 



k J\\F: 



□ 



Let us now proceed to the proof of Theorem [TJ Part of the proof is inspired by results in 
|Bickel et al., 2009| . Let s < min(n,M) and *e5 M with M(^f) < s. Let J = {k ; ^ 0}. 
To simplify the notations, write \P = By definition of Sa = G^G T one has that 



S-G*G 



M 



S - G*G 



T 



M 



+ 2A^ 7fc ||* 



k\\e 2 - 



+ 2Xj2 r rk\\^k\\e 2 < 

k=l k=l 

Using the scalar product associated to the Frobenius norm (A, B) F = tr (A T B^j then 



(A.12) 



S - G*G 



Putting (jA~l~3l) in (iAl2l we get 



S + W-G*G 



|wi|i,+ 



S - G*G 



+ 2(W,S-G*G' 

F \ IF 



(A.13) 



M 



S - G*G 



+ 2A^ 7fc ||* fc ||^ 2 < 



k=l 



S - G*G 



+ 2(W,G(*-*)G 



M 

+2A^7 fc ||* fc ||f 2 . 

k=l 

For k = 1, . . . , M define the M x M matrix with all columns equal to zero except the 
k-th. which is equal to — \Ev Then, remark that 



M 



M 



M 



W,G * - * G 



k=i 



k=i 



= ^ (W, GA,G t ) f = 52 (G T WG, A k )^ = 52 vl^k - *a 

k=l 

M 

< 5~] WvkhA&k - "&k\U 2 , 



fc=i 
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where r\ k is the fc-th column of the matrix G T WG. Define the event 

M 



A=f] {2||?7fc||£ 2 < A 7fe } 



(A.14) 



k=l 



Then, the choices 



Ik = 2||Gfc||£ 2 W/3 max (GG ), A = || Enoise || 2 1 + A/TT + 



n / 2<5 log M 



JV V iV / 

and Lemma [2] imply that the probability of the complementary event A c satisfies 

M 



'(^ C )<^P(2|| % ||, 2 >A 7fc )<M 1 



-5 



k=l 



Then, on the event A one has that 



S - G$G T 


2 

< 


S - G*G T 




F 





M 



+ \^ 7fc||*fc - *fc||< a 
fe=l 

+2Aj^7fc (||* fc ||f 2 - ||*fc||/ a 



M 



k=l 



Adding the term A^ fe=1 7fc||*fc — *jfc||| 2 to both sides of the above inequality yields on the 
event A 



S - G*G 



+ Aj^ 7fc ||* fe - * k \\ t2 < 



k=i 



S - G*G 



M 



+2A^7 fc (||* & -* i .||^ 2 + ||* 



k\\l 2 - \\^k\\l 2 



k=l 



Now, remark that for all k ^ J, then \\*&k — ^kWh + ll^fcll-fe — ll^fcll^ = 0> which implies that 
on the event A 



S - G*G 



A^ 7A ,.||* fc - * fc ||^ 2 < 



fc=l 



S-G*G 



(A.15) 



+4A^7 fc ||* fc - * fc | 



«2 



< 



fceJ 



S-G*G 



2 

If 



(A.16) 



+4A V / M*) /^7fell*fc - 
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where for the last inequality we have used the property that for the positive reals Ck = JkW^k 
* fc || /3J keJ then (E fceJ c k f < M^) £ fceJ 4 
Let e > and define the event 



,-2 



I fceJ 



S-G*G 



(A.17) 



Note that on the event A D Af then the result of the theorem trivially follows from inequality 
(|A.15p . Now consider the event .An At (all the following inequalities hold on this event). Using 
(|A.15[> one has that 



Al 



^Yl 7fc||*fc-*kk <4(l + l/e)A^ 7fc ||§ fc -* 



k\\e 2 - 



(A.18) 



k=l 



fceJ 



Therefore, on A n Ai 



lk\\^k - *fc|k < (3 + 4/e)^7fc||* fc - VkWta- 
kg J fceJ 

Let A be the M x M symmetric matrix with columns equal to A& = 7& ( — ) > k 



l,...,M, and cq = 3 + 4/e. Then, the above inequality means that 



coE 



fee,/ ll A fclk 2 



and thus Assumption [T] and Lemma [3] imply that 



< 



«icoZ)7fcii*fc-*fciii < 

fceJ 



GAG 



<4GL,/w(G T G) G(*-*)G 



(A.19) 



Let 7ma X = 4G^ ax/ o max (G T G). Combining the above inequality with (|A.16|) yields 



S - G*G 



< 



< 



S - 


G*G T 


S - 


G*G T 



+ 4A^ iC 1 o7maxV / A^(*) 



+ 4A^, c 1 7max\/>'(*) 



G(* - *)G 
G§G T - S 



+ 



G*G 



Now, arguing as in [Bi ckel et al., 2009] , a decoupling argument using the inequality 2xy < 
bx 2 + b~ l y 2 with b > 1, x = 2\K~l j raax \/M{^f) and y being either G*G T - S 
||Gvl/G T — S|| „ yields the inequality 



or 



S - G*G 



1 b -±l 

F~ V b- 1 



S-G*G 



2 Rh 2 -y 2 

+ 7T 7r 2 X A 2 A^(^). 



(A.20) 



Then, taking b = 1 + 2/e and using the inequalities 



£ - G*G 



< 2 IIS - Slli + 



S - G*G 



of Theorem [TJ 



and ||S - G*G T ||^ < 2||S - H\\ 2 F + 2 ||S - G*G T ||^ completes the proof 

□ 
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A.6 Proof of Theorem H 



Part of the proof is inspired by the approach followed in Lounici, 2008 and 
|Lounici et al., 2009| . Note first that 



M 



max 7/% 
Kk<M 



e-2 



k=i 



(2 



Since G {*f? € 5a/ : M < s*}, we can use some results from the proof of Theorem (fTJ). 
On the event .An At, with A defined by (|A7[4j) and A\ defined by (|AT7l) . inequality ([A~T8]) 
implies that 



- ii~ / 1\ 11 - 

< 4 1+- 

fe=l 2 ^ 7 fceJ* 



< 4(1 + 1) ^ 

v 7 V fceJ* 



Let A* be the M x M symmetric matrix with columns equal to At = 7^ ( — \l/t ) , k 



1, . . . ,M, let 7 m ax = 2G max y / p max (G T G) and Co = 3 + 4/e. Then, the above inequality and 
(jA.19p imply that on the event AnAi 



M 



< 



4(l + i)v^ 



jfc=i 



4(l + e)^ 



7i 



GA*G 



< l(l±I)v^ y 



V S*,CQ 



G * - ** G 



,co 



< 4 ^ + ^ ^ 7maxV / ^V / g0 M, iV, S», S,**, G, S T 



Then, using (|A.15p one has that on the event A n Af 

M 



J^7fc *k~*J 



fc=i 



< 



l + e 



£2 A 



S-G**G 



Therefore, by definition of C\, the previous inequalities imply that on the event A (of probability 
1 - M l ~ s ) 



< Ci(n,M,AT, s*,S,**,G,£ r 



(A.21) 



Hence max * fc — *t 
which proves the first assertion of Theorem [2 



< C\ (a, n, M, iV, s* , G, S no j se ) with probability at least 1 — M 



1-5 



36 



Then, to prove that J = J* we use that ^= 
k = l,...,M. Then, by (TOTll 



I* 



/l2 



< A. 
— sjn 



for all 



<5fc 









5k 



(2 



I* 



71 



k\\e 2 



< Ci{n,M,N, s*,S,**,G,£ nc 



which is equivalent to 



-Ci(n,M,iV,s,,S,**,G,S noise ) < 



^2 V" 



Z-'noist 

(A.22) 



If Jfc € J then -4= 



A. Ilvb* II 



> Ci(n,M,iV,s*,S,**,G,S noise ). Inequality ■% 
< C\ faMyN^t^^jG^nrise) from (1X221) imply that ;! 



^2 



J6. Ilvp* || > 



^2 



Ci (n, M, JV, s*, S,\I/*, G, E n0 j ge ) > 0, where the last inequality is obtained 



using that k G J. Hence [l*^!^ > and therefore fc G J* . If G J* then 



,.,!,_ / 0. Inequality -Ci (n, M, N, s*, S,**, G, S r 



- 



from (lA~22j) imply that 
2Ci(n,M, AT, s*,S,**,G,£ r 
tion (I3TTD on -^L ||* 



1.2 



Jth. II \b* || 



+ Ci(n,M,AT, S »,S,**,G,E noise ) > ^ > 
where the last inequality is obtained using Assump- 



k\\£ 2 - 



Hence 



^2 



> 2Ci (n,M,JV,s»,S,**,G,S r 



Ci (n, M, N, s*, S,**, G, £ noise ) = Ci (n, M, A 7 ', s*, S,**, G, E^e) and therefore I £ J. This 
completes the proof of Theorem [2j □ 



A. 7 Proof of Theorem [3] 

Under the assumptions of Theorem [3j we have shown in the proof of Theorem [2] that J = J* 
on the event A defined by (|A.14p . Therefore, under the assumptions of Theorem [3] it can be 
checked that on the event A (of probability 1 — M 1_<5 ) 



with 



Now, from the definition (|3.19p of E j* it follows that on the event A 



Gj.SGj* (gJ,Gj» 



Gj.Gj. 



1 



(A.23) 



where Aj* = *j. + (GT.G J .)- 1 Gj.S noiae Gj. (Gj.Gj.) . Let Y, = (Gj.Gj.) Gj.X; for 



1, . . . , N and remark that 



J* 



1 N 

-^Y,Y7 with E*j, = Aj.. 



i=i 
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Therefore, is a sample covariance matrix of size s* x s* and we can control its deviation in 
operator norm from Aj* by using Proposition [3j For this we simply have to verify conditions sim- 
ilar to (Al) and (A2) in Assumption [2] for the random vector Y = (Gj.Gj*) 1 Gj»X G M s *. 
First, let £ R s * with \\(3\\ h = 1. Then, remark that Y T /3 = X T f3 with ^ = Gj* (Gj.Gj*) -1 (3. 
Since ||/3||^ < (p min (Gj,Gj.))~ 1/2 it follows that 



Gj.Gj. 



(A.24) 



where p(£, £ noise ) = 8 1 / 4 (p 4 (S) + p 4 (E„)) 1/4 . Now^ let Z = ||Y|| /a < 

Pmil 2 (Gj*Gj») ll-^-lks- Given our assumptions on the process X = X + £ it follows that 
there exists a > 1 such that 



< 



-1/2 /W 



"a — rmm 



Gj*Gj. + ||W|UJ<+oo, 



(A.25) 



where Z = ||X||^ 2 and TV = ||£||^, with X = (X (h) , X (t n )y and £ = (£ (h) ,...,£ (t n )). 
Finally, remark that 



IA 



< II* 



J*ll 2 +P: 



+ PrJr, I Gj, Gj» 



J noise\\2 • 



(A.26) 



Hence, using the relations (|A.24j) and (|A.25j) . the bound ()A.26p and Proposition [3] (with Y 
instead of X), it follows that there exists a universal constant (5* > such that for all x > 0, 



J* - Aj 



^ tat iS ,x) ^ exp (-(<$■„ 1 x)z+«) , 



(A.27) 



where fjv, s » = max(A 2 Nst , B NtSt ), with Ajv )S * 

p 2 (S,S„ oise )p~?- n (G J» G j* ) /,, _ ,, i „ 
^ 1" UI V ^*ll2 +/ 5 min ^J»W* 

Then, define the event 



I v / I5j^(logjV) 1 /° 



and B 



N,s* 



M 2 



J nmse\\2) 



An,s*, with d* = rnin(AT, s*). 



B 



A, 



< f^A (log(M)) 



2 + q 



and note that, for x = 6* (log(M)) a with 5+ > 5*, inequality (|A.27j) implies that F (B) > 



1 7Ta 



l-M \. S *J . Therefore, on the event At~)B (of probability at least 1 — M 1-(5 — M 
using inequality (|A.23p and the fact that J = J* one obtains 



'Si, \ 



Gj.Gj. fjv, s A(log(M)) — 



which completes the proof of Theorem [3l 



□ 
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